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ABSTRACT 



Person-fit research in the context of paper-and-pencil tests 
is reviewed, and some specific problems regarding person fit in the context 
of computerized adaptive testing (CAT) are discussed. Some new methods are 
proposed to investigate person fit in a CAT environment. These statistics are 
based on Statistical Process Control (SPC) theory. A technique from SPC that 
is effective in detecting small shifts in the mean of the variable being 
measured is the cumulative sum (CUSUM) procedure. How CUSUM is applied to 
CATs is outlined, and eight statistics are proposed to investigate the sum of 
consecutive negative or positive residuals. Two simulation studies evaluated 
the use of these eight statistics. Results show that the detection rates of 
these statistics are dependent on the type of aberrance simulated. Conditions 
under which the statistics can be used or should not be used are discussed. 
(Contains 2 tables and 26 references.) (SLD) 
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Person Fit Based on Statistical Process Control in an Adaptive Testing Environment 



Introduction 

The aim of computerized adaptive testing (CAT) is to construct an optimal test for an 
individual examinee. To achieve this, the ability of the examinee is estimated during test 
administration and items are selected that match the current ability estimate. This is done 
using an item response theory (IRT) model that is assumed to describe an examinee s response 
behavior. It is questionable however, whether the assumed IRT model gives a good description 
for each individual’s test behavior. For those individuals for whom this is not the case, as a 
measure of the ability level the current ability estimate may be inadequate and as a result the 
construction of an optimal test may be difficult. 

There are all sorts of causes that may invalidate an ability estimate. For example, 
examinees may take a CAT to familiarize themselves with the questions to be asked and 
randomly guess the correct answers on almost all items in the test, or examinees may 
have preknowledge of some of the items in the item pool and correctly answer these items 
independent of their trait level and the item characteristics. These types of aberrant behavior 
may invalidate the ability estimate and it seems therefore useful to investigate the fit of a score 
pattern to the test model. Research with respect to methods that provide information about the 
fit of a score pattern to a test model is usually referred to as appropriateness measurement or 
person-fit measurement. Most studies in this area are, however, in the context of paper-and- 
pencil (P&P) tests. As will be argued below, the application of person-fit theory presented in 
the context of P&P tests cannot simply be generalized to CAT. 

The aim of this article is to first give an introduction to existing person-fit research in 
the context of P&P tests, then to discuss some specific problems regarding person fit in the 
context of CAT, and finally to explore some new methods to investigate person-fit in a CAT 
environment. 



Person Fit and Paper and Pencil Tests 

Three methods have been proposed to investigate the fit of an examinee to an IRT model: 
person-fit statistics, person-fit tests, and the person response function. 

In IRT, the probability of obtaining a correct answer on item i ( i = l,...,n) is 
ERIC explained by an examinee’s latent trait value ( 9 ) and the characteristics of the item (Hambleton 
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* S ” an,i ”' te "’ <*»■ Le. O, denote the binary (0,,) response ,o hem f, the iKm 
iscriminatton parameter, b, He item difficult, pammeter, and c, the item guessing parameter 

The probabtl,^ of comectl, answering an item according the three-parameter logistic ,Kr 
model (3PLM) is defined by 



P(Ui = i \o) = p t (8) = c i + (i- a) 

1 + exp fag (0 - hA 1 ' 



When Ci — 0, the 3PLM becomes the two- 



exp [at ( 0 - 6 i )] • 
parameter logistic IRJ model (2PLM): 



Pi ( 0 ) = exp [q, (0 - 

1 + exp [a, (0 - 6j)] ‘ 



( 1 ) 



Person-Fit Statistics 



Most 1 



t person-fit research has been conducted using fit statistics. A general form in which 
most person-fit statistics can be expressed is 



~ p i (#)] u>i ( 0 ) , 

t=l 



( 2 ) 



where ( 8 ) is a suitable weight (see, e.g., Snijders, 1 998). Note that, 

items Uf equals U t ; thus, statistics of the form 



as a result ofdichotomous 



Y.M- p m 2 Vi{0), 

can be re-expressed as sialisiics of the form in Equation 2. The expected value of the statistic 
equals 0 and olten the variance is taken into account obtain a standardized version of 

■be stattsttc. For example. Wright and Stone (1979, proposed a person-ffi statistic based on 
standardized residuals, where the weight 



was taken resulting in 



( 0 ) 



1 

np i (*) [l -Pi(0)] 
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r [Ui - p i (^)l 2 

h nP iW l l ~ Pi W 



(3) 



V can be interpreted as the corrected mean of squared standardized residuals based on n items; 
relatively large values of V point to a deviant item score pattern. 

Most studies in the literature have been conducted using some suitable function of the log 
likelihood function 



Z = {UilnPi {0) -b [1 - Ui] In [1 - P x (0)]} . 

i=i 

This statistic, first proposed by Levine & Rubin (1979), was further developed by 
Drasgow, Levine, and Williams (1985). Two problems exist when using l as a fit statistic. The 
first problem is that the numerical values of Z depend on the trait level. As a result, examinees 
may or may not be classified as aberrant depending on the trait level. A second problem is that 
for classifying a response pattern as aberrant, a distribution of the statistic is needed. Using 
a distribution, the probability of exceedance or significance probability can be determined. 
Because large negative values of Z indicate aberrance, the significance probabilities in the left 
tail of the distribution are of interest. 

To overcome both problems Drasgow, Levine, and Williams (1985) proposed a 
standardized version of Z, Z 2 , that was less confounded with the trait level and was assumed 
to be standard normally distributed for long tests due to the central limit theorem 

l - E(l) 

[var(Z)]2 

where E (Z) and var (Z) denote the expectation and variance of Z, respectively. These quantities 
are given by 



E (Z) = {Pi ( 0 ) In Pi (6) + [1 -Pi (0)] In [l - Pi (6)]} , 

i=l 




6 
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and 



va*(O = £>(0) [1-P<(0)] 



In 



i - 1 



_m 1 

i-p<(0) 



However, as argued by Molenaar and Hoijtink (1990) / 2 is standard normally distributed 
for long tests when true 6 is used, but in practice 6 is replaced by its estimate 0 , which effects the 
distribution of l z . This was shown in Molenaar and Hoijtink (1990) and also in van Krimpen- 
Stoop and Meijer (in press). It both studies it was found that when maximum likelihood 
estimation was used to estimate 6 , the variance of l z was smaller than expected under the 
standard normal distribution. In particular for short tests and tests of moderate length (tests 
of 50 items or less) the variance was found to be seriously reduced. As a result, the statistic 
was very conservative in classifying score patterns as aberrant. In the context of the Rasch 
(1960) model, Molenaar & Hoijtink (1990) proposed three approximations to the distribution 
of l conditional on the total score: using (1) complete enumeration, (2) Monte Carlo simulation, 
and (3) a chi-square distribution, where the mean, standard deviation, and skewness of l were 
taken into account. These approximations are conditional on the total score which in the Rasch 
model is a sufficient statistic for 6 . Recently, Snijders (1998) derived the asymptotic distribution 
for a family of person-fit statistics that are linear in the item responses and in which 6 is replaced 
by 0. 

For a review of person-fit statistics, see Meijer and Sijtsma (1995). 



Person-Fit Tests 

A second way to investigate the fit of a score pattern is to test the null model of normal 
response behavior against an alternative model of aberrant response behavior. Levine and 
Drasgow (1988) proposed a method for the identification of aberrant persons which was 
statistical optimal; that is, no other method can achieve a higher rate of detection at the same 
Type I error rate. They calculated a likelihood ratio statistic that provides the most powerful test 
for the null hypothesis that an item score pattern is normal versus the alternative hypothesis that 
it is aberrant. Here, the researcher in advance has to specify a model for normal behavior and 
a model that specifies a particular type of aberrant behavior. Klauer (1991, 1995) followed the 
same strategy and used uniformly most powerful tests in the context of the Rasch model to test 
against person-specific item discrimination, violations against local stochastic independence, 
q "nd unidimensionality. 




7 
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Person Response Function 

A third alternative to investigate person fit is to use a person response function (PRF). A 
PRF gives the probability of a correct response for a person with a fixed 9 as a function of b. In 
IRT, the item response function is assumed to be a nondecreasing function of 9 , whereas the PRF 
is assumed to be nonincreasing in b. The fit of an examinees score pattern can be investigated by 
comparing the observed and expected PRF. Large differences between observed and expected 
PRFs may indicate nonfitting response behavior. Let tl items be ordered by their b- values and 
let item rank numbers be assigned accordingly, such that 



b x <b 2 < ... < 6 , 



Furthermore, let 9 be the maximum likelihood estimate of 9 under the 3PLM. Choose n such that 
A r (r= 1 ordered classes can be formed, each containing m items; thus A\ = 

A 2 = {m+ 1, ...,2m} , ...., A R = {k-m+ 1, fc}.Within each class, for a particular 9 value 
the difference of observed and expected correct scores is taken and this difference is divided by 



the number of items in the subtest. Using E — Pi ^0^ , this yields 



DAO] [Ui-PilO 

i€A a 



j all s lj ..., Sr 



Next, the Ds are added across subtests, which yields 



D (o) = 



o 



D(9) was taken as a measure of an individual’s fit to the model by Trabin and Weiss 
(1983). For example, if an examinee copies the answers to the most difficult items, the scores 
on the most difficult subtests are likely to be substantially higher than predicted by the expected 
PRF. Nering and Meijer (1998) compared the PRF approach with the l z statistic using simulated 
data and found that the detection rate of l z in most cases was higher than using the PRF. They 
suggested that the PRF and l z can be used in a complementary way: misfitting score patterns 
can be detected using l z , and differences between expected and observed PRFs can be used to 
retrieve more information at a subtest level. Klauer and Rettig (1990) statistically refined the 
methodology of Trabin and Weiss (1983). 
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Person Fit in Computerized Adaptive Testing 

To investigate the fit of a score pattern in a CAT one obvious option is to use one of 
the methods proposed for P&P tests. However, research conducted with these methods in a 
CAT showed that this is not straightforward. Nering (1997) evaluated the first four moments 
of the distribution of l z for a CAT. His results were in concordance with the results using 
p&p tests: the variance and the mean were smaller than expected and the distributions were 
negatively skewed. As a result the normal approximation in the tails of the distribution was 
inaccurate. Van Krimpen-Stoop and Meijer (in press) simulated the distributions of l z and l* z 
, an adapted version of l z in which the variance was corrected for 6 according to the theory 
presented in Snijders (1998). They simulated item scores with a fixed set of administered items 
and item scores generated according to a stochastic design, where the choice of the administered 
items depended on the responses to the previous items administered. Results indicated that the 
distribution of l 2 and l z differed substantially from the theoretical distribution although the item 
characteristics and the test length determined the severeness of the difference. Glas, Meijer, 
and van Krimpen (1998) adapted the person tests discussed by Klauer (1995) to the 2PLM and 
investigated the detection rate of these statistics in a CAT. They found very low detection rates 
for most simulated types of aberrant response behavior: the detection rates varied between 0.01 
and 0.24 at significance level a = 0.10 (one-sided). 

A possible explanation for these results is that the characteristics of a CAT are unfavorable 
for the assessment of person fit using existing person-fit statistics. The first problem is that a 
CAT contains relatively few items compared to a P&P test. Because the detection rate of a 
person-fit statistic is sensitive to test length - longer tests will result in higher detection rates 
(e.g., Meijer, Molenaar, & Sijtsma, 1993) - the detection rate for a CAT will, in general, be lower 
than for P&P tests. A second problem is that almost all person- fit statistics use the spread in the 
item difficulties: generally speaking, aberrant response behavior consists of many 0 scores to 
easy items and many 1 scores to difficult items. In a CAT, the spread in the item difficulties is 
relatively modest: in particular at the end of the test when 6 is close to true 9, items with similar 
item difficulties will be selected and as a result it is difficult to distinguish normal from aberrant 
score patterns. 

An alternative to the use of P&P person-fit statistics is to use statistics that are especially 
designed for CAT. Few examples of research using such statistics exist. McLeod and Lewis 
^ (1998) discussed a Bayesian approach for the detection of a specific kind of aberrant response 
: RJC behavior, namely for examinees with preknowledge of the items. They proposed to use a 
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posterior log-odds ratio as a method for detecting item preknowledge in a CAT. However, the 
effectiveness of this ratio to detect aberrant response patterns was unclear because of the absence 
of a distribution and critical values of the log-odds ratio. 

Below we will explore the usefulness of new methods to investigate person-fit in CAT. We 
will propose new statistics that can be used in a CAT and that are based on theory from Statistical 
Process Control. Also, the results of simulation studies investigating the critical values and 
detection rates of the statistics will be presented. 

Statistical Process Control 



In this section theory from Statistical Process Control (SPC) is introduced.. SPC is often 
used to control production processes. Consider, for example, the process of producing tea bags, 
where each tea bag has a certain weight. Too much tea in each bag is undesirable because of 
the increasing costs of material (tea) for the company. On the other hand, the customers will 
complain when there is too few tea in the bags. Therefore, the weight of the tea bags need to be 
controlled during the production process. This can be done using techniques from SPC. 

A (production) process is in state of statistical control if the variable being measured has a 
stable distribution. A technique from SPC that is effective in detecting small shifts in the mean 
of the variable being measured, is the cumulative sum (CUSUM) procedure, originally proposed 
by Page (1954). In a CUSUM procedure, sums are accumulated, but only if they exceed ’the 
goal value’ by more than d units. Let Z t be the value of statistic Z (e.g., the standardized mean 
weight of tea bags) obtained from a sample of size N at time point t. Furthermore, let d be the 
reference value, and h some threshold. Then, the two-sided CUSUM procedure can be written 
in terms of C t H and C t L , where 

= max [0, Z\ — d] 

C 2 h = max [0, ( Z\ — d) + (Z% — d)] 

= max [0, ( Zi — d) + C^] 

= max [0, (Z 3 - d) + C^] 



C“ = max [0, (Z t - d) + C^_ j] , 



O 
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and analogously 



C t L = min [0, (Z t + d) + C t -i] , 

with starting values Cq - Cq - 0. Note that the cumulations can be running on both sides 
concurrently. The sum of consecutive positive values of Zt — d is reflected by C t and the 
sum of consecutive negative values of Z t + d by C f . Thus, as soon as.|Zt| > d the CUSUM 
procedure starts. The process is ’out-of-control when C^ > h or C h and in-control 
otherwise. This means that, after a number of consecutive positive or negative values of the 
statistic, the process can become ’out-of-control’. 

One assumption underlying the CUSUM procedure is that the Z r values are 
approximately standard normally distributed; the values of d and h are based on this assumption. 
The value of d is usually selected as one-half of the mean shift (in Z r units) one wishes to detect; 
for example, d - 0.5 is the appropriate choice for detecting a shift of one times the standard 
deviation of Z t . In practice, CUSUM-charts with d = 0.5 and h = 4 or h = 5 are often used 
(for a reference of the rationale see Montgomery, 1991, p.295). Setting these values for d and h 
results in a significance level of approximately a — 0.0027 (two-sided). Note, that in person-fit 
research a is fixed and critical values are derived from the distribution of the statistic. In this 
study, we will also use a fixed ot and derive critical values through simulation. 

CAT and Statistical Process Control 

CUSUM procedures investigate strings of positive and negative values of a statistic. 
Person-fit statistics are often defined in terms of the difference between observed and expected 
scores, see Equation 2. A common used statistic is V, the mean of the squared standardized 
residuals based on n items, defined in Equation 3. One of the drawbacks of V is that negative 
and positive residuals can not be distinguished which in a CAT is interesting because a string 
of negative or positive residuals may indicate aberrant behavior. For example, suppose an 
examinee with an average $-value responds to a test and during test administration the examinee 
becomes more and more careless because he/she becomes tired. As a result, in the first part of 
the test the responses will be an alternation of zeros and ones, whereas in the second part of the 
test more and more items are incorrectly answered due to carelessness; thus, in the second part 
of the test, consecutive negative residuals may occur. 

Sums of consecutive negative or positive residuals can be investigated using a CUSUM 
^ procedure. This can be explained as follows. A CAT can be viewed as a multistage test, where 



Person Fit Based on SPC - 9 



each item is a stage and each stage can be seen as a timepoint; at each stage a response to one 
item is given. Let i k denote the kth item in the CAT; that is, k is the stage of the CAT Further, let 
the statistic T k be a function of the residuals at stage k, n the final test length, and let, without 
loss of generality, the reference value d be equal to 0. Below, some examples of statistic T 
are proposed. For each examinee, at each stage k of a CAT, the CUSUM procedure can be 
determined as 



c? 


= max[0 ,T k +Ck-i], 


(4) 


Ct 


= min [0, Tk + Ck-i ] , and 


(5) 


C 


II 

II 

o 


(6) 



where C H and C L reflect the sum of consecutive positive and negative residuals, respectively. 
Let UB and LB be some appropriate upper and lower bound, respectively. Then, when 
C H > UB or C L < LB the response pattern can be classified as nonfitting the model, 
otherwise, the response pattern is normal. 

Some Person-Fit Statistics 

Let S k denote the set of items administered as the first k items in the CAT and R k = 
{1, \Sk- i the set of remaining items in the pool; from R k the kth item in the CAT is 
administered. A principle of CAT is that 9 is estimated at each stage k based on the responses 
to the previous administered items; that is, the items in set S k _i. Let 9 k - i denote the estimated 
6 at stage k — 1 and 9 n the final estimate of 9. Then, based on this value 9 k -i , the item for the 
next stage, k, is selected from R k . The probability of correctly answering item i k , according to 
the 2PLM, evaluated at 9 k -\ can be written as 



(® fc -0 = \ 



exp 



d{. 



+ exp a ik 



(h-i-K)] 



(7) 



Two sets of four statistics, all corrected for test length and based on the difference between 
observed and expected item scores, are proposed. The first four statistics, T 1 through T 4 , 
are proposed to investigate the sum of consecutive positive or negative residuals in an on-line 
situation, when the test length of the CAT is fixed. These four statistics use as expected score the 
Probability of correctly answering the item, evaluated at the updated ability estimate, defined 
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in Equation 7. The other four statistics, T 5 through T 8 , use as expected score the probability of 
correctly answering the item, evaluated at the final ability estimate 0 n . As a result of using 6 n 
instead of 8 k , the development of the accumulated residuals can no longer be investigated in an 
on-line situation. 

All statistics are based on the general form defined in Equation 2: a particular statistic is 
defined by choosing a particular weight. In line with the literature on person fit in P&P tests, 
see for example Equation 3, two statistics are proposed where the residual U — P (•) is weighted 
by the estimated standard deviation, [P (•) (1 — P (•))] 2 • 

In order to construct person-fit statistics that are sensitive for nonfitting behavior in 
the beginning part of the CAT, two statistics are proposed where the residual is weighted by 
reciprocal of the square root of the test information function containing the items administered 
up to and including stage k. This information function is a monotone increasing function of 
the stage of the CAT and as a result, the residuals in the beginning of the test become a larger 
weight than the residuals in the last part of the CAT. 

Also, two statistics are proposed to detect nonfitting behavior at the end of the CAT. Here, 
the residuals are multiplied by the square root of the stage of the CAT, y/k. Due to the increasing 
function y/k, the residuals at the beginning of the CAT are less weighted than residuals at the 
later part of the CAT. 

Define 

n = ti x {p <fc (^fc-i) [i-^ (Mir*’ 

Tl = Tl x {/ (flfc-i) } 2 , and 
n = v/fcxTfc, 



where I \6 k j is the test information function according to the 2-PLM, of a test containing the 
items administered up to and including stage k, evaluated at 6 k \ that is. 



I 



(^ fc ) “ 1*2 ^9 (f k ' a h 

i g es k 



>*,) = E < p >. («*) l 1 

i g es k 




O 



Thus, T fc x is the residual of the response and the probability of a correct response to item i k , 
vhere the probability is evaluated at the estimated ability at the previous stage; T£,T k , and T£ 



are functions of these residuals. Due to the 
nature of the CAT is taken into account. 
Define 
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use of the updated ability estimate, the 



sequential 



Tl = 



n 



T 7 

1 k 

T? 



n x {a. (».)[,_„ . (a.)]}'*. 

15 x {'(*•)}"*. «d 

Vk X Tf, 



and inc “" 8 admmls,ered up “ 
1 ^ \5 '** a< " K ) = £‘‘- p ‘- W f 1 - p ‘. («.)] • 

The statistics T 5 through T 8 are proposed to investigate the sum of co 

positive residuals, evaluated at the final estimate 9 D t * ' negative or 

development ofthe accumulated residuals can no Ion' h ^ ***** ^ ^ 

These eight satiate can used ™ ^ 

through 6. As a reS u,,„ f *, uleof *' Pr “' d ' ,re *-*- * ■*— 4 

residuals is updated after each item response. Pr ° Ce ^ ^ SUm ° f P ° S ' tiVe 3nd ne S ative 

-~;r:r er boun ‘ ,s in a cus ™ * - — * ^ 

® are tar from standard normal- T l anH ts f A n 

distribution with onlv one rtKc *• , . and T 1 follow a binomial 

n also Based o„ o„, y o„e * 22 . ^ ~ 

r;: ~ 

wid, the fixed vahtes a = 0.05 audTo '° Wer h 01 "” 1 " ‘ nVeS,iSaKd ,hr ° U8h Si ""“°"' 



The parametric CUSUM procedure investigates string f 

•n a CAT. A drawback of the CUSUM nro h • ° COrTect or lncotTect responses 

CUSUM procedute , s the .hsettee of guide, for d eBmnnat , on 
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of the upper and lower bound for nonnormally distributed statistics. Therefore, in Study 1, 
a simulation study was conducted to investigate the numerical values of the upper and lower 
threshold of the CUSUM procedures using statistics T 1 through T 8 across 0-levels. When these 
bounds are independent of 0, a fixed upper and lower bound for each statistic can be used. In 
Study 2, the detection rate of the CUSUM procedures with the statistics T 1 through T 8 for 
several types of nonfitting response behavior are investigated. 

In this simulation study we use the 2PLM because it is less restrictive with respect 
to empirical data than the one-parameter logistic model and it does not have the estimation 
problems of the guessing parameter in the three-parameter logistic model (e.g., Baker, 1992, 
pp. 109-1 12). The 2PLM has shown to have a reasonable fit to several achievement and 
personality data (e.g., Reise & Waller, 1990; Zickar & Drasgow, 1996). Furthermore, in these 
two studies true item parameters were used. This is realistic when item parameters are estimated 
using large samples: Molenaar and Hoijtink (1990) found no serious differences between true 
and estimated item parameters for samples consisting of 1, 000 examinees or more. 



Five datasets consisting of 10,000 normal adaptive response vectors each were 
constructed at five different 0-levels; 0 = —2, —1, 0, 1, and 2. An item pool of 400 items 
fitting the 2PLM with a { ~ N( 1; 0.2) and b t ~ U(~ 3; 3) was used to generate the adaptive 
response vectors. 

The normal response vector was simulated as follows. First, the true 0 of a simulee was 
set to a fixed 0-level. Then, the first item of the CAT selected was the item with maximum 
information given 0 = 0. For this item, P (0), according to Equation 1 was determined. To 
simulate the answer (1 or 0), a random number y from the uniform distribution on the interval 
[0, 1] was drawn; when y < P (0) the response to item i was set to 1 (correct response), 0 
otherwise. The first four items of the CAT were selected with maximum information for 0 = 0, 
and based on the responses to these four items, 0 was obtained using weighted maximum 
likelihood estimation (Warm, 1989). The next item selected was the item with maximum 
information given 0 at that stage. For this item, P (0) was computed, a response was simulated, 
0 was estimated and another item was selected based on maximum information given 0 at that 
stage. This procedure was repeated until the test attained the length of 30 items. 

For each simulee, eight different statistics, T 1 through T 8 , were used in the CUSUM 



Study 1 



Method 
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procedure described in Equations 4, 5, and 6. Then, for each simulee and for each statistic, 

max C H = max (Cf?) and 
m'\nC L = min (Cfc) 

were determined resulting in 1 0,000 values of C H and C L for each dataset and for each statistic. 
Then, for each dataset and each statistic, the upper bound, UB, was determined as the value of 
max C H for which 2.5% of the simulees had higher max C H -values and the lower bound, LB , 
was determined as the value of minC L for which 2.5% of the simulees had lower minC' L - 
values. That is, a two-sided test at a < 0.05 was conducted, where P (ma xC H > UB) = 
P (min C L < LB) = 0.025. In other words, for each statistic two bounds (upper and lower 
bound) per dataset were determined, where 5% of the simulees attained values outside these 
bounds. 

When the bounds are stable across 0 , it becomes possible to use one fixed upper and one 
fixed lower bound for each statistic. Therefore, to construct fixed bounds, the weighted average 
of the upper and lower bound were calculated across 9 for each statistic, with different weights 
for different 0-values; weights 0.05, 0.2, and 0.5, for 9 = ±2, ±1, and 0, respectively were 
used. These weights represent a realistic population distribution of abilities. 

Results 

In Table 1 the upper and lower bounds, at a < 0.05 (two-sided), of statistics T 1 through 
T 8 are tabulated at five different 0-levels. Table 1 shows that, for most statistics the upper and 
lower bounds were stable across 0-levels. For statistic T 7 , the bounds were less stable across 
0-values. Table 1 also shows that, for most statistics except T 4 and T 8 , the weighted average 
bounds were approximately symmetrical around 0. 

As a result of the stable bounds for almost all statistics across 0, one fixed upper and lower 
bound can be taken as bounds for the CUSUM procedures. 



Study 2 



Method 

To examine the detection rates for the eight proposed statistics, different types of nonfitting 
response behavior were simulated. Aberrant item scores were simulated for all items, and 
O >r the first or second part of the response pattern. Six datasets containing 1,000 nonfitting 
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adaptive response patterns were constructed; an item pool of 400 items with a* ~ N (1.0; 0.2) 
and hi ~ U (-3; 3) was used. The detection rate was defined as the proportion of nonfitting 
response patterns that were classified as aberrant. For each response vector, CUSUM procedures 
using T 1 through T 8 were performed; a response vector was classified as nonfitting when 

ma xC H {T q ) > UB(T q ) or 
min C L {T q ) < LB{T q ) 



for q— 1, ...8. The upper and lower bounds for each statistic were set to the weighted average 
of the values presented in Table 1. To facilitate comparisons, a dataset containing 1,000 model 
fitting adaptive response patterns was constructed and the percentage of normal response vectors 
classified as aberrant was determined. 



Types of aberrant response behavior 

Random response behavior. 

To investigate nonfitting behavior to all items of the test, random response behavior to 

# 

all items was simulated. This type of response behavior may be the result of guessing the 
answers to the items of a test and was empirically studied by Van den Brink (1 977). He described 
persons who took a multiple-choice test only to familiarize themselves with the questions that 
would be asked. Because returning an almost completely blank answering sheet may focus 
a teacher’s attention on the ignorance of the examinee, each examinee randomly guessed the 
correct answers on almost all items of the test. ” Guessing” simulees'were assumed to answer 
the items by randomly guessing the correct answers on each of the 30 items in the test with 
a probability of 0.2. This probability corresponds to the probability of obtaining the correct 
answer by guessing in a multiple-choice test with five alternatives per item. 

Non-invariant ability. 

To investigate nonfitting response behavior in the first or in the second half of the CAT, 
response vectors with a two-dimensional 9 were simulated (Klauer, 1991). It was assumed 
that during the first half of the test an examinee had another 9 value than during the second 
half. Carelessness, fumbling, or memorization of some items can be the cause of non-invariant 
abilities. Two datasets containing response vectors with a two-dimensional 9 were simulated 
by drawing two ability values, 9 1 and 9 2 , from a bivariate standard normal distribution; the 
correlation between the two values was modeled by the parameter p. Thus, during the first half 
^ f the test P (0i) was used and during the second half P (9 2 ) was used to simulate the responses 
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to the items. The values p — 0 and p — 0.5 were used here to simulate the response patterns. 

Violations of local stochastic independence. 

To examine aberrance in second part of the CAT, response vectors with violations of 
local stochastic independence between all items in the test were simulated. When previous 
items provide new insights that are useful for answering the next item, or when the process of 
answering the items is exhausting, the assumption of local independence between the items may 
be violated. A generalization of a model proposed by Jannarone (1986) (see Glas, Meijer, and 
van Krimpen, 1998) was used to simulated response vectors with local independence between 
all subsequent items 



where <5^+1 is a parameter modeling association between items (see Glas et al., 1998, for more 
details). Using this model, the probability of correctly answering an item is now determined 
by the item parameters a and b, the person parameter 0 and the association parameter 6. When 
6 = 0 the model equals the 2PLM. Compared to the 2PLM, positive values of 6 result in a 
higher probability of a correct response (e.g., learning-effect), and negative values of 5 result 
in a lower probability of correctly answering an item (e.g., carelessness). The values 6 — — 2, 
— 1, 1, and 2 were used to simulate nonfitting response patterns. 



In Table 2 the detection rates for the eight different CUSUM procedures are given for 
several types of nonfitting response behavior and normal response behavior. For the dataset 
of normal response patterns and for most statistics the percentage of simulees classified as 
nonfitting was around the expected 0.05. For T 3 , however, the percentage of simulees classified 
as nonfitting was 0.13. 

Furthermore, the detection rates for guessing simulees were high for most statistics, except 
for T 5 and T 8 ; for example, the detection rate was 0.19 for T 5 , whereas for T 7 it was 0.97. 

For most statistics the detection rates for non-invariant abilities were lower than for 
guessing. For p — 0, the detection rates were approximately 0.30 for most statistics except 
for T 3 (0.13) and T 7 (0.13). For p = 0.5 the detection rates were approximately 0.20 for most 
statistics, except for T 3 and T 7 (0.11 for both). 

Furthermore, Table 2 shows that, for violations of local independence, highest detection 
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rates occurred for 8 = 2. For 8 = 1 the detection rates were approximately 0.10 for all statistics, 
whereas for 8 = 2 the detection rates were approximately 0.30 for most statistics, except for T 1 
(0.22) and T 7 (0.18). 

Discussion 

The results of Study 1 showed, that the bounds of the CUSUM procedures were rather 
stable across 0-values for all statistics except T 7 . As a result, for most statistics, one fixed U B 
and LB can be used as threshold for the CUSUM procedures. In Study 2 these fixed bounds 
were used to determine the detection rates for the eight CUSUM procedures. Results showed 
that using statistic T 3 the percentage of normal response patterns as nonfitting was larger than 
the expected 0.05; T 3 resulted in a high percentage of normal response patterns classified as 
nonfitting, thus, the CUSUM procedure using statistic T 3 might result in a liberal classification 
of nonfitting response behavior. 

Because there are few other studies using person fit in a CAT, it is difficult to compare 
the detection rates with other studies. An alternative is to compare the results with results from 
studies using P&P data. For example, despite differences in simulating the data, the results of 
this study were similar to the results of Zickar and Drasgow (1996). In the Zickar and Drasgow 
(1996) study, real data were used where some examinees were instructed to distort their own 
responses; detection rates between 0.01 and 0.32 were found for P&P data. 

This study showed that the detection rates were dependent on the type of aberrance 
simulated. It was shown that most statistics were sensitive to random response behavior to 
all items in the test. It was also shown that most statistics were sensitive to response behavior 
when simulees used a two-dimensional 6. None of the proposed statistics was shown to be 
sensitive to local dependence between all items in the test when the probability of a correct 
response decreased during the test (negative values of <5); an example of this behavior is an 
examinee who becomes more and more careless along the test because he or she becomes tired. 
However, most statistics were shown to be sensitive to violations of local independence when 
the probability of a correct response increased during the test (positive values of 8); for example, 
when previous items provide new insights that are useful for answering the next item. 

Detection rates were low for some types of nonfitting response behavior. Testing the null 
model of normal response behavior against an alternative model of specific nonfitting response 
behavior may be a suitable alternative. 

The proposed CUSUM-procedure using statistics T 1 through T 4 can be used in an on-line 
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situation where during administration the statistics can be computed. One application of person 
fit in an on-line situation is to take action when an examinee is getting ’out-of-control’ . Then, 
items from a different item pool containing new items can be selected to prevent inflated test 
scores due to item pre knowledge. 
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Table 1 Bounds of CUSUM using statistics T 1 through T 8 . 
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Table 2 Detection rates of CUSUM procedures. 
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